NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Toward Weight Sharing Paradigm for Efficient AI: Training and Inference Serving

https://doi.org/10.1145/3759441.3759447

Behnam, Payman; Khare, Alind; Garg, Dhruv; Tumanov, Alexey (August 2025, ACM SIGOPS Operating Systems Review)

Deep neural networks are increasingly required to operate across diverse hardware platforms, latency constraints, and power budgets, which motivates the need for specialized models for each scenario. However, designing and training a separate model per scenario or serving a large ensemble of models is often impractical. Weight sharing has emerged as a promising paradigm to address this challenge by training a single ''SuperNet'' that subsumes many sub-models (SubNets), and by reusing weights across those SubNets both at training and inference time. This paper provides an abridged survey of our recent advances that leverage weight sharing for efficient AI, covering both training and inference serving. In centralized once-for-all training, Delayed ε-Shrinking (DεS) improves training efficiency by strategically scheduling the introduction of smaller SubNets during training. In a federated fashion, SuperFedNas co-trains a SuperNet across distributed clients and disjoins training and searching, which enables oneshot specialization to many deployment targets at minimal cost. ∇QDARTS integrates quantization into differentiable architecture search, jointly finding neural architectures, weights, and low-precision settings to yield highly efficient models in a single search. For inference serving, SuperServe introduces a weight-shared model with dynamic SubNet routing (SubNetAct) to instantaneously switch among a spectrum of accuracy-latency operating points, coupled with a scheduler (SlackFit) for unpredictable workloads. Finally, SUSHI co-designs model, system, and accelerator to exploit weightshared SuperNets on tinyML devices, caching SubGraphs on FPGA to reduce latency and energy. Together, these works demonstrate that the weight sharing paradigm can dramatically improve the efficiency of both training and inference serving of deep models across a range of scenarios.
more » « less
Free, publicly-accessible full text available August 4, 2026
SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-device Inference

https://doi.org/10.1007/978-3-031-72986-7_10

Khare, Alind; Agrawal, Animesh; Annavajjala, Aditya; Behnam, Payman; Lee, Myungjin; Latapie, Hugo; Tumanov, Alexey (November 2024, Springer Nature Switzerland)

Neural Architecture Search (NAS) for Federated Learning (FL) is an emerging field. It automates the design and training of Deep Neural Networks (DNNs) when data cannot be centralized due to privacy, communication costs, or regulatory restrictions. Recent federated NAS methods not only reduce manual effort but also help achieve higher accuracy than traditional FL methods like FedAvg. Despite the success, existing federated NAS methods still fall short in satisfying diverse deployment targets common in on-device inference including hardware, latency budgets, or variable battery levels. Most federated NAS methods search for only a limited range of neuro-architectural patterns, repeat them in a DNN, thereby restricting achievable performance. Moreover, these methods incur prohibitive training costs to satisfy deployment targets. They perform the training and search of DNN architectures repeatedly for each case. SuperFedNAS addresses these challenges by decoupling the training and search in federated NAS. SuperFedNAS co-trains a large number of diverse DNN architectures contained inside one supernet in the FL setting. Post-training, clients perform NAS locally to find specialized DNNs by extracting different parts of the trained supernet with no additional training. SuperFedNAS takes O(1) (instead of O(N)) cost to find specialized DNN architectures in FL for any N deployment targets. As part of SuperFedNAS, we introduce MaxNet—a novel FL training algorithm that performs multi-objective federated optimization of DNN architectures (≈5∗108) under different client data distributions. SuperFedNAS achieves upto 37.7\% higher accuracy or upto 8.13x reduction in MACs than existing federated NAS methods.
more » « less
Full Text Available
Hardware–Software Co-Design for Real-Time Latency–Accuracy Navigation in Tiny Machine Learning Applications

https://doi.org/10.1109/MM.2023.3317243

Behnam, Payman; Tong, Jianming; Khare, Alind; Chen, Yangyu; Pan, Yue; Gadikar, Pranav; Bambhaniya, Abhimanyu; Krishna, Tushar; Tumanov, Alexey (November 2023, IEEE Micro)

Tiny machine learning (TinyML) applications increasingly operate in dynamically changing deployment scenarios, requiring optimization for both accuracy and latency. Existing methods mainly target a single point in the accuracy/latency tradeoff space, which is insufficient as no single static point can be optimal under variable conditions. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that activates different SubNets within a SuperNet. This creates an opportunity to exploit the inherent temporal locality of different queries that use the same SuperNet. We propose a hardware–software co-design called SUSHI that introduces a novel SubGraph Stationary optimization. SUSHI consists of a novel field-programmable gate array implementation and a software scheduler that controls which SubNets to serve and which SubGraph to cache in real time. SUSHI yields up to a 32% improvement in latency, 0.98% increase in served accuracy, and achieves up to 78.7% off-chip energy saved across several neural network architectures.
more » « less
Full Text Available
SubGraph Stationary Hardware-Software Inference Co-Design

Behnam, Payman; Tong, Jianming; Khare, Alind; Chen, Yangyu; Pan, Yue; Gadikar, Pranav; Bambhaniya, Abhimanyu Rajeshkumar; Krishna, Tushar; Tumanov, Alexey (June 2023, Proceedings of Machine Learning and Systems)
Song, Dawn; Carbin, Michael; Chen, T (Ed.)
Full Text Available
Adaptively Reduced DRAM Caching for Energy-Efficient High Bandwidth Memory

https://doi.org/10.1109/TC.2022.3140897

Behnam, Payman; Nazm Bojnordi, Mahdi (January 2022, IEEE Transactions on Computers)

Full Text Available
CoDG-ReRAM: An Algorithm-Hardware Co-design to Accelerate Semi-Structured GNNs on ReRAM

https://doi.org/10.1109/ICCD56317.2022.00049

Luo, Yixuan; Behnam, Payman; Thorat, Kiran; Liu, Zhuo; Peng, Hongwu; Huang, Shaoyi; Zhou, Shu; Khan, Omer; Tumanov, Alexey; Ding, Caiwen; et al (October 2022, 2022 IEEE 40th International Conference on Computer Design (ICCD))

Full Text Available
STFL-DDR: Improving the Energy-Efficiency of Memory Interface

https://doi.org/10.1109/TC.2020.2978826

Behnam, Payman; Nazm Bojnordi, Mahdi (March 2020, IEEE Transactions on Computers)

Full Text Available
TinyADC: Peripheral Circuit-aware Weight Pruning Framework for Mixed-signal DNN Accelerators

https://doi.org/10.23919/DATE51398.2021.9474235

Yuan, Geng; Behnam, Payman; Cai, Yuxuan; Shafiee, Ali; Fu, Jingyan; Liao, Zhiheng; Li, Zhengang; Ma, Xiaolong; Deng, Jieren; Wang, Jinhui; et al (February 2021, Design, Automation & Test in Europe Conference & Exhibition (DATE))

As the number of weight parameters in deep neural networks (DNNs) continues growing, the demand for ultra-efficient DNN accelerators has motivated research on non-traditional architectures with emerging technologies. Resistive Random-Access Memory (ReRAM) crossbar has been utilized to perform insitu matrix-vector multiplication of DNNs. DNN weight pruning techniques have also been applied to ReRAM-based mixed-signal DNN accelerators, focusing on reducing weight storage and accelerating computation. However, the existing works capture very few peripheral circuits features such as Analog to Digital converters (ADCs) during the neural network design. Unfortunately, ADCs have become the main part of power consumption and area cost of current mixed-signal accelerators, and the large overhead of these peripheral circuits is not solved efficiently. To address this problem, we propose a novel weight pruning framework for ReRAM-based mixed-signal DNN accelerators, named TINYADC, which effectively reduces the required bits for ADC resolution and hence the overall area and power consumption of the accelerator without introducing any computational inaccuracy. Compared to state-of-the-art pruning work on the ImageNet dataset, TINYADC achieves 3.5× and 2.9× power and area reduction, respectively. TINYADC framework optimizes the throughput of state-of-the-art architecture design by 29% and 40% in terms of the throughput per unit of millimeter square and watt (GOPs/s×mm 2 and GOPs/w), respectively.
more » « less
Full Text Available
STFL: Energy-Efficient Data Movement with Slow Transition Fast Level Signaling

https://doi.org/https://doi.org/10.1145/3316781.3317819

Behnam, Payman; Bojnordi, Mahdi N. (January 2019, DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019)

Full Text Available
R-Cache: A Highly Set-Associative In-Package Cache Using Memristive Arrays

https://doi.org/10.1109/ICCD.2018.00070

Behnam, Payman; Pal Chowdhury, Arjun; Nazm Bojnordi, Mahdi (October 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD))

Over the past decade, three-dimensional die stacking technology has been considered for building large-scale in-package memory systems. In particular, in-package DRAM cache has been considered as a promising solution for high bandwidth and large-scale cache architectures. There are, however, significant challenges such as limited energy efficiency, costly tag management, and physical limitations for scalability that need to be effectively addressed before one can adopt in-package caches in the real-world applications. This paper proposes R-Cache, an in-package cache made by 3D die stacking of memristive memory arrays to alleviate the above-mentioned challenges. Our simulation results on a set of memory intensive parallel applications indicate that R-Cache outperforms the state-of-the-art proposals for in-package caches. R-Cache improves performance by 38% and 27% over the state-of-the-art direct mapped and set associative cache architectures, respectively. Moreover, R-Cache results in averages of 40% and 27% energy reductions as compared to the direct mapped and set-associative cache systems.
more » « less
Full Text Available

« Prev Next »

Search for: All records